Preconditioned Spectral Descent for Deep Learning
نویسندگان
چکیده
Deep learning presents notorious computational challenges. These challenges include, but are not limited to, the non-convexity of learning objectives and estimating the quantities needed for optimization algorithms, such as gradients. While we do not address the non-convexity, we present an optimization solution that exploits the so far unused “geometry” in the objective function in order to best make use of the estimated gradients. Previous work attempted similar goals with preconditioned methods in the Euclidean space, such as L-BFGS, RMSprop, and ADAgrad. In stark contrast, our approach combines a non-Euclidean gradient method with preconditioning. We provide evidence that this combination more accurately captures the geometry of the objective function compared to prior work. We theoretically formalize our arguments and derive novel preconditioned non-Euclidean algorithms. The results are promising in both computational time and quality when applied to Restricted Boltzmann Machines, Feedforward Neural Nets, and Convolutional Neural Nets.
منابع مشابه
Preconditioned Spectral Descent for Deep Learning: Supplemental Material
Table 1: Parameter Settings for Learning RBMs. RMSprop parameters chosen to match [5]. SGD parameters chosen to match [26]. SSD and A-SSD stepsizes and geometries chosen to match [1]. The stepsize on W is given for the RBM. λ corresponds to the damping factor in the history terms in the ADA and RMS methods. Projections refers to the numbers of projections used in the Random SVD algorithm [9] in...
متن کاملPreconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks
Effective training of deep neural networks suffers from two main issues. The first is that the parameter spaces of these models exhibit pathological curvature. Recent methods address this problem by using adaptive preconditioning for Stochastic Gradient Descent (SGD). These methods improve convergence by adapting to the local geometry of parameter space. A second issue is overfitting, which is ...
متن کاملPreconditioned Stochastic Gradient Descent
Stochastic gradient descent (SGD) still is the workhorse for many practical problems. However, it converges slow, and can be difficult to tune. It is possible to precondition SGD to accelerate its convergence remarkably. But many attempts in this direction either aim at solving specialized problems, or result in significantly more complicated methods than SGD. This paper proposes a new method t...
متن کاملStochastic Spectral Descent for Restricted Boltzmann Machines
Restricted Boltzmann Machines (RBMs) are widely used as building blocks for deep learning models. Learning typically proceeds by using stochastic gradient descent, and the gradients are estimated with sampling methods. However, the gradient estimation is a computational bottleneck, so better use of the gradients will speed up the descent algorithm. To this end, we first derive upper bounds on t...
متن کاملShampoo: Preconditioned Stochastic Tensor Optimization
Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates ...
متن کامل